A Decade of Togetherness: Uncovering Sentiments and Trends in WhatsApp Group Chats¶

By Saurabh Kudesia | Aug 2025¶

GitHub Kaggle LinkedIn


© 2025 Saurabh Kudesia

This project is licensed under the MIT License. You are free to use, modify, and distribute this code, provided you include proper attribution and retain the license notice.

License: MIT

WhatApp Group Chat

Image Courtsey: Unsplash.com

Background¶

In today's digital age, group chats have become dynamic reflections of social connection—capturing shared memories, humor, opinions, and emotional shifts. This project explores a WhatsApp group chat among batchmates to uncover patterns in communication, sentiment, and engagement over time.

By applying Natural Language Processing (NLP) and data visualization techniques, the analysis reveals how conversations evolve, which topics spark interaction, and how emotions ebb and flow in response to events. Ultimately, the project provides a data-driven glimpse into how a close-knit peer community communicates, bonds, and grows across a decade of digital dialogue.

Objectives¶

  • Analyze Sentiment Trends: Identify emotional tones (positive, negative, neutral) and how they shift over time.
  • Detect Conversation Peaks: Pinpoint days or events with unusually high activity or emotional intensity.
  • Identify Active Participants: Highlight the most engaged members based on message volume and frequency.
  • Discover Trending Topics: Use keyword analysis to uncover recurring themes and high-engagement discussions.
  • Visualize Communication Patterns: Track how message frequency and sentiment change over days, weeks, and months.
  • Understand Engagement Behavior: Analyze patterns such as peak activity hours, common media types, and link sharing habits.

Data Dictionary¶

  • Data Source: Exported WhatsApp chat file (.txt) from a group of batchmates.
  • Time Span: Messages exchanged from 1 Oct 2018 - 25 Jul 2025.
  • Message Types: Messages (28671), images (500), videos (109) 6 Audio (6) contacts (18), documents (10), and spreadsheets (3).

Terminology & Conventions¶

This project uses the following icons and labels to highlight key types of supplementary information:

  • 💡 Data Insights: Highlights observations, best practices, or useful tips related to the data.
  • 🚫 Limitations: Indicates known constraints, issues, or areas where the data or analysis may be incomplete or unreliable.
  • 🔍 Analytical Insights: Explains reasoning behind a method, interpretation of results, or patterns uncovered through analysis.

Analytical Framework & Toolchain Overview¶

This project employs a multi-layered analytical framework that blends classic data science, modern Natural Language Processing (NLP), and deep learning—underscoring the complexity and depth of the methodology. Rather than relying on a single technique or toolkit, the analysis unfolds across several broad and interdependent themes:

  • Data Engineering & Parsing: Raw WhatsApp exports are unstructured and noisy. Through a combination of file system operations, hashing, regex parsing, and structured data tools like pandas, the project performs intensive preprocessing to convert years of chaotic conversation into analyzable formats. This foundation supports all subsequent layers of analysis.

  • Linguistic Processing & Text Normalization: Handling multilingual, emoji-laden, colloquial chat data is non-trivial. The project uses a pipeline of language detection, stemming, stopword filtering, and emoji interpretation—alongside tools like NLTK and langdetect—to normalize text while preserving semantic nuance.

  • Emotional & Semantic Mapping: By combining rule-based sentiment models (SentimentIntensityAnalyzer) with vectorized text representations (TfidfVectorizer, SentenceTransformer), the project maps the emotional tone and thematic drift of group interactions over time. This adds psychological depth to the statistical backbone.

  • Machine Learning & Classification: Supervised learning models are trained to classify themes, detect user roles, and track topic clusters. Clustering (e.g., KMeans) and dimensionality reduction (PCA) are used to distill insights from high-dimensional data, revealing latent behavioral trends.

  • Multimodal Intelligence: Beyond text, the project integrates shared media. With state-of-the-art transformer models like BLIP and CLIP, it performs caption generation, image-text alignment, and semantic tagging, making this not just a chat analysis—but a multimodal communication study.

  • Narrative Visualization: Static plots, dynamic charts (plotly), word clouds, and annotated timelines collectively turn analytical output into intuitive visual narratives. This enables exploration of interactions not just numerically, but as evolving social stories.

In essence, this project fuses data engineering, NLP, machine learning, multimodal AI, and visual storytelling to decode the patterns of connection, sentiment, and behavior in long-term group chat data. The methodology reflects both computational sophistication and creative curiosity, transforming casual digital chatter into a meaningful social dataset.

Data Constraints and Analytical Caveats¶

  • Incomplete Historical Coverage

    The dataset begins on 1 October 2018, corresponding with the author’s entry into the group. Consequently, interactions and trends between the group’s inception on 29 August 2015 and this start date are not captured in the analysis.

  • Data Currency & Scope

    The data is current as of 25 July 2025 and the analysis includes only those members who have been active (i.e., posted at least once) since the dataset began, and therefore does not reflect the full history or membership of the group over time.

  • User Identity Tracking Limitations

    WhatsApp identifies users by phone numbers, which can change over time. If a user has switched phone numbers during the analysis period, they may be counted as two distinct participants. As a result, the number of unique users identified in the dataset may not accurately match the actual number of distinct group members.

  • Admin Role Uniformity

    Since all members are designated as administrators, the analysis does not differentiate between admin and non-admin behavior. This limits the ability to explore leadership dynamics or role-based engagement patterns.

  • Lack of Demographic Information

    The analysis lacks access to demographic variables such as age, gender, location, or professional background, as this information is not available from WhatsApp. This constrains any exploration of how such factors may influence communication behavior or participation levels.

    In the absence of additional user-provided details, this analysis estimates the geographic distribution of users based solely on the country codes extracted from their phone numbers. While this provides a reasonable approximation of user location, it may not reflect actual residency or physical presence.

  • Multilingual Communication Complexity

    The group's communication includes English, Indian languages, and code-mixed content, which poses challenges for natural language processing (NLP). Sentiment analysis, keyword extraction, and topic modeling may have reduced accuracy due to code-switching, informal phrasing, and non-standard syntax.

  • Modeling Limitations

    The analytical tools employed may not accurately detect sarcasm, humor, irony, or subtle bias. As a result, some emotional tones or cultural nuances may be misinterpreted or overlooked in the analysis.

  • Contextual Interpretation

    This project is intended as an exploratory, informal analysis for insights and engagement—not as a formal audit or evaluation. The findings are not prescriptive and should not be interpreted as critiques of individual members or group governance.

  • Illustrative Use of Personal Examples

    While significant effort was made to anonymize all data to protect participant privacy, certain illustrative examples may include intentional references to known group content or expressions. These are used respectfully and strictly for contextual clarity, without judgment or bias.

Executive Summary¶

This analysis offers a comprehensive view of a dynamic WhatsApp group spanning 6.8 years, capturing 28,671 messages from 63 unique contributors. The findings reflect a resilient, emotionally intelligent, and professionally active digital community, offering key insights for community management, engagement optimization, and scalable communication strategy.

Key Highlights¶

Community Engagement & Participation¶

  • High Activity: 28,671 messages, averaging 4,200+ messages annually.
  • Broad Involvement: 105% participation rate (63 participants vs. 60 current members) reflects sustained engagement across time, including former members.
  • Contributor Distribution: Top 20 users drive 60-70% of content, revealing a core-periphery structure with potential for middle and lower-tier activation.

Temporal & Behavioral Patterns¶

  • Peak Activity Windows: Business hours (10–11 AM, 3–5 PM) on weekdays dominate engagement, though weekend participation remains consistent.
  • Stable Over Time: No major drop-offs, with message volume aligned to group milestones and event-driven peaks.

Communication Style & Sentiment¶

  • Balanced Messaging: Combination of short, informal texts and long, context-rich posts fosters both agility and depth.
  • Positive Sentiment Dominance: Rich in appreciation, recognition, and celebration—reinforced by consistent emoji use (🎉, 🙏, 😊).
  • Low Conflict Zone: Minimal negative sentiment and high psychological safety mark a healthy, trust-based environment.

Cultural & Linguistic Identity¶

  • English-Led Multilingualism: English (85.7%) is dominant, but Hinglish and Hindi add regional nuance and relatability.
  • Code-Mixing Patterns: Informal and context-driven language use reflects an Indian professional and alumni network context.

Media & File Sharing¶

  • Rich Content Diversity: 500+ images, videos, and documents enhance communication quality and engagement.
  • Professional & Social Balance: Shared media supports learning, collaboration, and celebration—reinforcing both knowledge and community bonds.

Influencers & Recognition¶

  • Certain individuals (e.g., Sender_036, Sender_011) are central to discussion and recognition, playing informal leadership roles. These contributors are key to tone-setting, morale boosting, and knowledge sharing.

Challenges Identified¶

  • Participation Inequality: Core group dominates activity; 33% of users show low engagement.
  • Content Overload Risk: High message volume may lead to information fatigue.
  • Dependence on Key Users: Heavy reliance on a few contributors for momentum and leadership.

Strategic Recommendations¶

Engagement Equity¶

  • Reactivate the bottom 20% through targeted outreach.
  • Empower mid-tier users with light leadership roles and recognition.
  • Prevent burnout in top contributors via rotating responsibilities.

Communication Optimization¶

  • Schedule key updates during peak hours.
  • Guide discussion flow with prompts and structured topics.
  • Maintain language inclusivity with English for clarity and Hinglish/Hindi for cultural resonance.

Culture & Sentiment¶

  • Reinforce positivity through visible appreciation and shared celebration.
  • Monitor emotional tone to detect early signs of disengagement or group fatigue.

Knowledge & Media Management¶

  • Standardize and tag valuable shared content.
  • Promote high-signal media (infographics, short videos, documents).
  • Summarize key threads to improve retention and accessibility.

Analytics & Governance¶

  • Introduce dashboards and KPIs (e.g., participation equality, engagement per user).
  • Upgrade sentiment and language detection tools for deeper insights.
  • Establish privacy-respecting feedback loops and ethical data practices.

Strategic Implications¶

  • The group serves as a model for high-performing digital communities, with relevance for alumni networks, professional forums, or remote teams.
  • There is strong scalability potential, provided participation is balanced, content is curated, and engagement is guided by data.
  • Investing in analytics, recognition systems, and inclusive leadership will be critical for sustainable growth.

Conclusion¶

This WhatsApp group exemplifies mature digital community dynamics—where emotional intelligence, knowledge exchange, and cultural authenticity intersect. With deliberate action to address participation disparities and content organization, the group can evolve into a replicable blueprint for resilient, scalable, and inclusive online communities.

Exploratory Data Analysis¶

Group Dynamics & Participation¶


What Percentage of Participants post?¶

⚠️ Active participants exceed total declared participants. 

🚫 Limitations

The higher number of active_participants reflects the full history of group activity, not just the present membership snapshot.

While it may seem counterintuitive, the number of unique active participants (i.e., message senders) in a WhatsApp group can exceed the current total group members. This usually happens due to one or more of the following reasons:

  • Outdated or Manually Entered Group Size: The participants count may reflect the current or recent group size, but the chat export typically includes historical messages, including those from users who have since left the group. These past participants are still counted as active senders.
  • Multiple Sender IDs for the Same Person: If a participant changed their phone number or left and rejoined, WhatsApp may log them as separate users, increasing the count of unique senders. System or Non-member Messages: Message logs sometimes include System messages (e.g., "You added X"), Temporary participants or WhatsApp Business accounts, Unknown numbers not saved in contacts. All of these can be mistakenly counted as unique senders even if they are not current members.

Who are the most and least active participants?¶

No description has been provided for this image
No description has been provided for this image

At what times is the group most active?¶

No description has been provided for this image

When do the top 10 posters usually post?¶

No description has been provided for this image

How has group activity changed over time?¶

No description has been provided for this image

How has each user's activity changed year over year?¶

No description has been provided for this image

💡Data Insights

  • Group Composition

    Out of a current group size of 60 members, there have been 63 unique active participants, reflecting a 105% participation rate. This surplus is likely due to three previously active members who have since left the group. With an average of 455 messages per active participant, the group demonstrates a high level of meaningful and sustained engagement.

  • Participation Distribution

    The Pareto Principle is clearly evident as the top 20 users account for approximately 60-70% of all messages. This highlights a significant engagement disparity between highly active and minimally active members.Top 20 Contributors serve as the core influencers, consistently driving discussions and sharing valuable insights. Bottom 20 Contributors show low participation, representing an opportunity for re-engagement through targeted strategies.

  • Daily & Weekly Trends

    Engagement is heavily concentrated during business hours (9 AM - 6 PM), with peak activity observed in two key windows: Mid-morning (10-11 AM) and Late afternoon (3-5 PM). Activity is strongest during weekdays (Monday to Friday), reflecting professional use patterns. A sustained weekend presence suggests that members maintain a flexible, work-life integrated approach to participation.

  • Long-Term Trends

    Over the course of years, message volume has remained consistently strong, indicating community durability and long-term engagement. Event-driven spikes coincide with major updates, discussions, or announcements. The absence of significant drop-off points to the group's relevance and continued value over time.

  • User Behavior Segmentation

    • Top Contributors: These individuals drive the community's pulse, contributing 60-70% of overall activity. They frequently share high-value content, lead discussions, and influence group culture. Their continued participation is critical, making them a strategic priority for retention and recognition.
    • Moderate Contributors: Serving as the reliable backbone of the group, moderate contributors engage regularly without dominating conversations. As growth candidates, they present ideal targets for subtle nudges to deepen participation.
    • Low Contributors: Comprising roughly 33% of the group, these underutilized members exhibit minimal activity. While their churn risk may be moderate, they represent a strategic opportunity for targeted re-engagement through personalized outreach or reactivation campaigns.

Language & Communication Style¶


What languages are used in the group?¶

Detecting the language of WhatsApp messages poses unique challenges due to the informal, multilingual, and often code-mixed nature of the content. Many users switch between languages mid-sentence (e.g., Hindi and English), use transliterations (e.g., "tum kya kar rahe ho?"), and include slang, abbreviations, or emojis that confuse traditional language detection models. Standard tools like langdetect or rule-based approaches struggle with such content, often misclassifying short messages or defaulting to incorrect language predictions when messages are only a few words long or contain mixed-language constructs.

To address these limitations, we will use FastText pre-trained language identification model developed by Facebook AI Research. The FastText’s lid.176.ftz model is trained on a vast corpus of short texts in 176 languages, making it significantly more robust for social media or chat-based inputs. It performs well even with minimal context and can handle noisy text and varied syntax more gracefully than traditional models. In this project, leveraging FastText ensures higher accuracy and reliability in language tagging, which is crucial for downstream analysis like sentiment detection, message categorization, or regional engagement insights.

🚫 Limitations

This analysis visualizes the distribution of languages used in WhatsApp group messages by leveraging a language detection model applied to each message. This provides a quick overview of the linguistic diversity in the group and highlights dominant languages used in conversations.

However, this method comes with notable limitations. WhatsApp messages are often short, informal, or filled with emojis, abbreviations, and spelling variations—all of which can confuse language detection models and reduce accuracy. Multilingual messages (common in informal chats) are often classified based on just a few dominant words, leading to oversimplification. Additionally, code-switching (switching languages mid-sentence) is not captured here, and the method assigns one language per message, which may not reflect the true linguistic blend.

Thus, while the chart gives a general sense of language use, it should be interpreted with caution, especially in multilingual or informal communication settings.

No description has been provided for this image

Which languages tend to appear together in posts?¶

No description has been provided for this image

Which languages are most frequently used?¶

No description has been provided for this image

How common is language-switching in messages?¶

🔍 Analytical Insights

To address the limitations of single-language detection in WhatsApp messages, we will conduct a refined analysis to capture code-mixing and multilingual usage. In this approach, we will extract a list of possible languages with confidence scores. For each non-emoticon message, we extract languages with a confidence above 5%, allowing us to identify multiple languages used within a single message. The results offers a more nuanced view of language diversity.

Despite its improved granularity, this method has important limitations. The accuracy of language detection drops significantly for short or noisy texts—common in WhatsApp chats. It may also misclassify informal, transliterated, or regionally mixed language content (e.g., Hinglish, Taglish). Furthermore, it doesn't account for the order or structure of languages used within the message, and confidence-based thresholds may still include spurious detections.

Thus, while this approach captures multilingual tendencies better than single-label models, results should still be interpreted qualitatively alongside linguistic context.

No description has been provided for this image

What is each user’s preferred language?¶

This section identifies the primary language used by each participant based on the content of their messages. By analyzing the language distribution at the individual level, we can infer users' language preferences, which may reflect their background, communication style, or target audience. Understanding language preference enables more inclusive group analysis and helps uncover multilingual dynamics within the group.

No description has been provided for this image

How long are messages on average?¶

No description has been provided for this image
No description has been provided for this image

How does message length vary year to year?¶

No description has been provided for this image
No description has been provided for this image

Who writes long messages and who keeps it short?¶

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

How verbose are the top 10 users over time?¶

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

💡Data Insights

  • Multilingual Communication & Cultural Expression

    This analysis uncovers nuanced language usage patterns within the group, offering key insights into its cultural identity, communication flexibility, and community inclusivity. These are critical factors for content strategy and group management.

  • Language Usage Overview

    • Dominant Language: English is the primary medium, used in 85.7% of all detected messages—underscoring its role as the group’s default for clarity, professionalism, and broad accessibility.
    • Undetected Content (12.9%): A significant portion of messages could not be conclusively classified due to:
      • Informal abbreviations and slang
      • Mixed-language or hybrid expressions (e.g., Hinglish, Taglish)
      • Emoji-heavy content, links, or technical strings

    This highlights the limitations of automated NLP tools in decoding casual and non-standard digital communication.

  • Minority Language Presence
    • Hinglish (0.4%) and Hindi (0.3%) reflect regional flavor and informal expression, enhancing relatability among Indian users.
    • Multilingual/Other (0.02%) entries suggest isolated use of other languages, likely context-specific or user-driven.
  • Code-Switching Behavior

    The group demonstrates dynamic language switching, with members blending English and Indian languages—especially Hindi—to tailor tone, express cultural identity, or connect informally.

  • Hinglish Patterns

    The 95 Hinglish messages exemplify this informal, conversational code-mixing style, prevalent in Indian digital communities. It reinforces peer familiarity and social bonding, particularly in less formal exchanges.

  • Linguistic Adaptability

    Users show high fluency in adjusting language based on context, audience, and message intent—a sign of communication maturity and cultural sensitivity.

  • Communication Preferences:

    • English dominates formal discussions, announcements, and informational sharing.
    • Hindi/Hinglish surfaces in celebratory, emotional, or casual interactions—indicating comfort-driven expression and social cohesion.
  • Inclusivity & Accessibility

    The multilingual environment enhances cultural authenticity while maintaining accessibility. Language diversity fosters a welcoming atmosphere, where participants can communicate in ways that feel natural and meaningful.

🚫 Analytical Limitations

Approximately 12.9% of messages remain unclassified/unknown due to:

  • Non-standard syntax or spelling
  • Emojis or image-only messages
  • Unconventional transliterations
  • Mixed-language structures beyond standard detection models

These constraints reflect the complexity of analyzing informal, multilingual digital discourse, suggesting a need for enhanced NLP models, manual validation, or context-aware review to fully capture the richness of such interactions.

Text & Word Analysis¶


How many words do messages typically contain?¶

No description has been provided for this image

What are the most frequently used words?¶

No description has been provided for this image

Which words does each user use the most?¶

No description has been provided for this image

Result Summary¶

SenderID Plain_Top_Words Normalized_Top_Words
0 Sender_001 the, to, and, of, in, is, you, for, it, this happy, one, https, india, birthday, good, many, day, people, please
1 Sender_002 the, wishes, birthday, to, in, and, is, of, it, this wishes, birthday, congratulations, mahesh, sardar, one, sunil, best, chinese, security
2 Sender_003 the, to, in, and, of, https, com, is, file, attached https, com, file, attached, www, img, jpg, happy, india, birthday
3 Sender_004 the, to, and, of, in, you, is, it, for, with happy, one, said, bday, https, time, man, day, good, many
4 Sender_005 the, to, and, of, in, for, is, this, you, it https, many, register, please, com, iimb, iimbaa, time, us, day
5 Sender_006 the, you, very, to, thanks, wish, happy, and, of, birthday thanks, wish, happy, birthday, lot, congratulations, research, india, adani, good
6 Sender_007 the, to, and, of, is, in, for, it, be, this https, one, com, india, day, us, people, also, good, like
7 Sender_008 the, to, and, is, of, in, for, you, it, will https, one, com, term, happy, birthday, www, long, know, get
8 Sender_009 the, to, of, and, in, is, that, you, was, it happy, birthday, india, one, us, day, many, years, https, world
9 Sender_010 the, of, happy, and, day, in, to, many, returns, for happy, day, many, returns, congratulations, thanks, discount, one, https, india
10 Sender_011 the, to, you, and, of, is, in, it, birthday, happy birthday, happy, thanks, wish, https, one, bhai, world, please, com
11 Sender_012 to, and, the, on, in, for, of, you, happy, great happy, great, day, years, birthday, last, modi, chinna, news, looks
12 Sender_013 the, of, and, in, to, you, happy, is, day, many happy, day, many, returns, wishes, best, thank, italy, chinese, congratulations
13 Sender_014 the, to, and, of, in, for, you, your, happy, with happy, birthday, please, us, thanks, one, https, time, share, com
14 Sender_015 the, to, it, happy, in, and, is, of, this, birthday happy, birthday, thanks, bde, guys, one, https, message, people, com
15 Sender_016 the, to, and, of, in, is, for, this, it, you birthday, happy, https, india, please, people, one, thanks, us, com
16 Sender_017 the, of, to, in, and, is, this, with, that, for india, us, also, karti, oxygen, company, one, iran, government, companies
17 Sender_018 the, to, and, is, in, of, you, that, he, it match, https, would, true, work, awesome, sure, old, please, com
18 Sender_019 to, and, of, in, the, is, we, for, this, will india, https, wk, happy, covid, birthday, free, please, one, www
19 Sender_020 the, to, and, of, in, is, you, for, it, this happy, birthday, people, day, one, https, us, would, get, time
20 Sender_021 the, and, to, happy, birthday, in, is, of, you, it happy, birthday, stay, blessed, congratulations, always, day, mahesh, thank, many
21 Sender_022 the, to, in, vs, and, of, june, is, for, you vs, june, com, india, risk, https, www, water, lemon, news
22 Sender_023 is, the, to, and, happy, you, in, bday, it, of happy, bday, thanks, good, please, one, https, com, get, okay
23 Sender_024 the, to, of, and, in, is, you, that, it, this https, time, people, sars, com, virus, one, business, take, india
24 Sender_025 the, to, and, happy, for, birthday, thanks, is, have, in happy, birthday, thanks, congrats, mahesh, guys, day, blessed, sunil, good
25 Sender_026 to, is, and, of, the, with, you, she, we, from guys, drive, may, people, even, need, many, happy, sunil, congratulations
26 Sender_027 the, to, and, of, for, is, you, this, in, happy happy, wish, https, register, bday, event, time, sunil, egmp, iimbaa
27 Sender_028 happy, birthday, the, nice, and, you, to, have, best, great happy, birthday, nice, best, great, congratulations, thanks, wishes, year, sunil
28 Sender_029 the, and, of, to, you, happy, is, it, thanks, many happy, thanks, many, birthday, day, dr, god, bless, hospital, sunil
29 Sender_030 the, to, you, and, of, it, for, in, many, we many, happy, wish, please, birthday, good, day, us, would, used
30 Sender_031 happy, birthday, to, the, and, you, in, of, for, this happy, birthday, congratulations, thank, mahesh, help, please, need, sridhar, best
31 Sender_032 happy, next, have, congratulations, to, you, birthday, ketan, be, and happy, next, congratulations, birthday, ketan, thanks, vishnu, time, bangalore, sunil
32 Sender_033 the, to, is, and, of, in, are, be, for, https https, youtu, people, india, market, like, us, one, get, com
33 Sender_034 and, the, to, you, happy, of, for, is, this, in happy, birthday, best, wishes, know, please, thank, proud, congratulations, thanks
34 Sender_035 happy, birthday, and, for, you, congratulations, to, year, annuity, the happy, birthday, congratulations, year, annuity, dear, one, mahesh, thanks, ahead
35 Sender_036 the, to, and, of, is, in, for, it, you, are wishes, birthday, mask, year, great, ahead, https, india, com, one
36 Sender_037 happy, many, returns, and, thank, you, all, to, wish, sunil happy, many, returns, thank, wish, sunil, great, mahesh, year, day
37 Sender_038 you, many, thank, to, of, the, in, is, happy, returns many, thank, happy, returns, day, city, please, picasso, wishing, congratulations
38 Sender_039 the, and, to, in, of, is, are, https, this, for https, com, market, one, happy, birthday, congratulations, counterargument, youtu, best
39 Sender_040 this, to, was, com, you, old, with, https, shoes, of com, old, https, shoes, message, deleted, app, file, attached, mars
40 Sender_041 you, thank, and, are, this, of, to, have, it, for thank, guys, https, www, friends, done, end, work, sorry, year
41 Sender_042 to, the, of, very, in, this, is, sunil, you, it sunil, sunitha, thanks, product, way, go, much, batch, india, registered
42 Sender_043 thank, you, for, the, sunil, sridhar, saurabh, vinod, everyone, and thank, sunil, sridhar, saurabh, vinod, everyone, making, day, special, vikal
43 Sender_044 in, to, the, this, for, of, was, from, you, it please, message, deleted, com, congrats, years, congratulations, looking, mail, share
44 Sender_045 thanks, am, with, thank, you, sunil, and, raman, been, very thanks, thank, sunil, raman, long, sure, hello, everyone, got, guys
45 Sender_046 congrats, that, and, file, attached, this, congratulations, best, awesome, thanks congrats, file, attached, congratulations, best, awesome, thanks, lot, friends, wishes
46 Sender_047 the, to, all, sunil, and, one, congratulations, with, for, you sunil, one, congratulations, thanks, room, shalini, possible, happy, group, best
47 Sender_048 many, happy, the, returns, of, day, sunil, you, all, wish many, happy, returns, day, sunil, wish, guru, cnn, congrats, sunita
48 Sender_049 to, the, we, all, is, this, of, and, signal, in signal, pl, thanks, group, https, congrats, happy, join, one, best
49 Sender_050 to, the, in, and, is, of, for, not, this, it may, also, one, like, think, good, us, india, please, even
50 Sender_051 to, the, you, happy, in, of, for, is, birthday, good happy, birthday, good, thanks, god, friends, bless, mahesh, sunil, group
51 Sender_052 to, and, the, all, you, of, congratulations, very, happy, awesome congratulations, happy, awesome, mmhrotd, mahesh, many, great, good, stay, congrats
52 Sender_053 the, to, and, in, is, of, it, for, you, with https, sea, one, happy, birthday, com, file, attached, good, img
53 Sender_054 this, message, was, deleted message, deleted
54 Sender_055 is, the, and, in, to, of, for, be, with, all help, thanks, looking, please, take, hospital, experience, also, guys, need
55 Sender_056 happy, to, you, and, birthday, the, of, thank, all, parenting happy, birthday, thank, parenting, course, amazing, many, returns, day, vani
56 Sender_057 the, and, to, in, of, is, this, was, for, india india, good, thanks, biju, indonesia, patnaik, wishes, dutch, lot, know
57 Sender_058 thanks, lot, for, the, wishes thanks, lot, wishes
58 Sender_059 is, supply, chain, microsoft, the, and, for, this, in, or supply, chain, microsoft, india, hyderabad, coe, hiring, cloud, growth, capacity
59 Sender_060 birthday, happy, and, have, year, ahead, to, great, congratulations, the birthday, happy, year, ahead, great, congratulations, wishes, best, nice, mahesh
60 Sender_061 happy, birthday, have, aparna, great, year, does, anyone, know, covid happy, birthday, aparna, great, year, anyone, know, covid, recovered, person
61 Sender_062 with, of, you, on, to, are, the, all, thanks, this thanks, good, dr, tainwala, based, butters, hi, connect, wow, vikas
62 Sender_063 the, to, it, and, but, of, for, is, have, this happy, good, mahesh, congratulations, would, one, like, ilango, even, quite

What themes dominate the group’s conversation?¶

No description has been provided for this image

Navigating the Memory Lane¶

Let's identify and visualize “throwback-triggering” messages — those that evoke nostalgia or reference shared past experiences. This helps us analyze collective sentiment, uncover moments of shared memory, and understand how nostalgia emerges and evolves in group chats over time. Such insights can reveal social bonding patterns, especially around events like school reunions, college anniversaries, old trips, or festive reflections, offering a deeper view into how the group recalls and relives its collective past.

To achieve this, we employ two complementary approaches:

  • Throwback Detection Using Defined Keywords
  • Keyword-Agnostic Throwback Detection Using NLP (semantic similarity)

Can we identify throwback posts using keywords?¶

This approach relies on a predefined list of nostalgic or memory-related terms—such as "remember", "trip", "college", "reunion", or "old days"—to identify messages that likely reference past events. By compiling these terms into a regex pattern, the system scans the chat dataset to filter out messages that explicitly include such words or phrases.

This approach is valuable for its simplicity and high precision: when users directly mention known nostalgic triggers, the system confidently classifies those messages as throwbacks. It allows quick insight into how often and when group members recall shared memories, providing a clear, quantifiable signal for analyzing emotional engagement and social bonding over time. However, it may miss subtler, indirect references to the past, which is where NLP-based methods complement this approach.

Total throwback-triggering messages: 471
Total users who shared throwbacks: 47
No description has been provided for this image
No description has been provided for this image

Can NLP uncover nostalgic messages?¶

To enhance the accuracy and depth of textual analysis—particularly in areas such as social media monitoring, customer feedback interpretation, and trend analysis—the Keyword-Agnostic Throwback Detection framework leverages a machine learning (ML) approach powered by natural language processing (NLP). Unlike traditional keyword-based methods, this solution identifies references to past events by analyzing contextual cues, temporal language patterns, and semantic similarities through advanced language models like BERT or RoBERTa.

This methodology enables the detection of subtle and implicit throwback mentions that static keyword filters often miss. As a result, it provides a more comprehensive understanding of how past topics resurface in current conversations, offering richer insights into consumer behavior, shifting sentiment, and long-term brand engagement.

Total throwback-triggering messages: 25
Total users who shared throwbacks: 15

 Sample Throwback Messages with Scores:
         SenderID                                      Clean_Message  \
10832  Sender_033                                    we have history   
9404   Sender_035                                     nice memories    
12909  Sender_006    nice to see all together! brings back memories!   
9401   Sender_029                                    nice memories..   
16969  Sender_062  wow! dont remember this pic. thanks for sharin...   

       throwback_score            Datetime  
10832         0.453242 2020-08-20 20:13:00  
9404          0.661112 2020-06-28 10:01:00  
12909         0.518130 2020-11-22 12:01:00  
9401          0.668752 2020-06-28 10:00:00  
16969         0.464323 2021-07-05 10:30:00  
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

💡Data Insights

The textual content generated by the group offers a rich foundation for analyzing communication dynamics, sentiment, and engagement patterns. The following breakdown explores the volume, structure, and nature of messages to derive actionable insights that support community understanding and strategic decision-making.

  • Textual Data Volume & Quality

    The group has produced a substantial volume of messages, forming a robust dataset for linguistic, thematic, and behavioral analysis. Through comprehensive preprocessing steps—including stopword removal, normalization, and text cleaning—the analysis focused on high-signal, meaningful content, increasing the accuracy and interpretability of the results.

  • Word Frequency & Thematic Trends

    Analysis of the most frequent terms—excluding generic stopwords—highlights a strong presence of positively charged, socially supportive, and professionally oriented language. Common words such as “thanks,” “good,” “congratulations,” “happy,” and “connect” suggest a culture of gratitude and recognition, ongoing professional networking and milestone acknowledgments and Social cohesion built through consistent positive reinforcement.

  • Message Length & Communication Style

    The group demonstrates a balanced communication pattern, blending short-form messages (for quick acknowledgments, emojis, or one-word affirmations) and longer posts (for detailed updates, reflections, and context-rich discussions). This dynamic allows for both efficiency and depth, accommodating various communication preferences and engagement levels.

  • User-Level Communication Patterns

    Top contributors by average message length often lead extended conversations, share resources, or provide detailed input—playing central roles in shaping group dialogue. Brief communicators typically offer quick responses, indicating efficient, mobile-first interaction styles, which are still vital for maintaining group momentum. Additionally, user-specific keyword analysis reveals personalized communication styles (e.g., formal vs. casual tone), content preferences (e.g., medical topics, social events, mentorship), emerging influencer roles, such as conversation starters and frequent responders

  • Keyword & Topic Variety

    The diversity in top keywords and message types points to a broad range of discussion themes, including professional updates, Peer recognition and celebration, Event planning and coordination, and Social bonding and informal interaction. This content richness reflects a multifaceted community identity that combines value-driven professional exchange with warm interpersonal connections.

  • Sentiment Signals

    Frequent use of emotionally positive words reinforces the presence of a supportive and encouraging group culture. This language pattern signals:

    • A community grounded in mutual respect
    • A tendency toward celebrating shared success
    • High levels of peer appreciation and motivation

Sentiment & Emotion Detection¶


What types of sentiments are present in messages?¶

🔍 Analytical Insights

For sentiment analysis, we will use VADER (Valence Aware Dictionary and sEntiment Reasoner), which is a rule-based sentiment analysis tool specifically designed to analyze sentiments expressed in social media and short text. Unlike traditional models, VADER uses a lexicon of sentiment-related words and incorporates rules to handle punctuation, capitalization, degree modifiers (like "very"), and slang. It outputs four scores—positive, negative, neutral, and a compound score (a normalized measure of overall sentiment). VADER is lightweight, fast, and works well out-of-the-box for texts like tweets, reviews, and chat messages.

Message Sentiment_Score Sentiment_in_Message
0 One of the most disturbing stories that we fin... 0.9815 Positive
2 Is this True? 0.4215 Positive
3 Best opposition leader to remain in opposition 😝 0.6369 Positive
4 Don’t know but really funny. 0.6474 Positive
5 Sunil and Ramki can clarify if it’s true 0.4215 Positive
... ... ... ...
28666 Hey guys, So how many of us are coming to IIM... 0.0000 Neutral
28667 <Media omitted> *Friendship meets inspiration ... 0.9767 Positive
28668 Dear all …a query around engg admissions in Bl... 0.7013 Positive
28669 <Media omitted> Kela beku neevu. 👌🙏🌹😊 0.0000 Neutral
28670 <Media omitted> *Wellness Beyond Buzzwords. Wh... 0.9001 Positive

24234 rows × 4 columns

Which users post the most emotional messages?¶

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

What’s the emotional profile of each user?¶

No description has been provided for this image

How is sentiment distributed among users?¶

No description has been provided for this image

How many sentiments are detected per message?¶

No description has been provided for this image

💡Data Insights

The sentiment analysis categorized messages as positive, neutral, or negative, using both textual and emoji/emoticon cues to capture emotional nuance.

  • Positive Sentiment: A substantial portion of messages reflected appreciation, encouragement, and celebration, indicating a supportive and upbeat group culture.
  • Neutral Sentiment: The majority of messages were neutral, emphasizing information exchange, coordination, and professional updates—aligning with the group's purpose-driven nature.

  • Negative Sentiment: Very few messages contained negative sentiment, underscoring a low-conflict environment with a strong sense of psychological safety.

  • Temporal Sentiment Trends: Analysis over time uncovered notable patterns:

    • Consistent Positivity: Peaks in positive sentiment aligned with achievements, celebrations, recognitions, and key group milestones.
    • Event-Driven Variability: Short-term increases in mixed or negative sentiment occasionally occurred in response to external events or difficult discussions, but these were rare and typically followed by a quick return to a positive baseline.
  • Emotion Detection & Expression Patterns

    • Emoji & Emoticon Use: The group frequently used emojis and emoticons, offering insight into non-verbal emotional expression. Emotions such as happiness, excitement, and approval were prevalent, evidenced by emojis like 😊, 🎉, 🙌, and 👍. Less frequent—but meaningful—emotional nuance (e.g., expressions of empathy, surprise, or concern) often appeared in response to personal updates, group concerns, or broader events.

    • Text-Based Emotion Signals: Emotion detection in written content supported the emoji-based findings. Recurring use of phrases like “congratulations,” “thank you,” “well done,” and “happy to connect” reinforced a tone of positivity and mutual encouragement. Textual emotion cues suggested high levels of community warmth, trust, and camaraderie.

  • Key Influencers & Positivity Drivers:
    A subset of members consistently shared motivational, celebratory, or appreciative messages, serving as informal leaders and morale boosters. These users help set the tone for group culture and are central to sustaining positivity and momentum.

  • Engagement Health Indicators:
    The combination of low negative sentiment, frequent emotional validation, and broad participation in positivity indicates strong group cohesion, emotional resilience, and a culture of inclusivity and psychological safety.

Emoji Usage & Personality¶


What emojis are used most frequently?¶

No description has been provided for this image

🔍 Analytical Insights

Emojis with skin tone modifiers—such as 👍🏻 (Thumbs Up: Light Skin Tone) and 👍🏽 (Medium Skin Tone)—are treated as distinct Unicode characters, not just stylistic variations of the default emoji. Listing them separately in analysis is a best practice because it preserves the intentional choices users make to reflect identity, inclusivity, or cultural expression.

Aggregating these variants under the base emoji would obscure meaningful behavioral insights and distort usage patterns. By treating them as separate entries, analysts can accurately capture user intent, identify trends in personalization, and ensure a more inclusive and representative understanding of digital communication habits.

What relationships exist between emojis?¶

No description has been provided for this image

How do emoji sentiments change over time?¶

No description has been provided for this image

What can emojis tell us about message mood?¶

No description has been provided for this image
No description has been provided for this image

Which emojis are the happiest or saddest?¶

Loading emoji sentiment data from CSV...

 User Personality Profiles from Emoji Use:
     SenderID Top_Emoji_1 Top_Emoji_2 Top_Emoji_3   Personality_Type
0  Sender_003      ⚘⚘⚘⚘⚘⚘        None        None            Neutral
1  Sender_004           ➙        None        None            Neutral
2  Sender_005           ☬        None        None            Neutral
3  Sender_007           ✓        None        None           Positive
4  Sender_023          ✓✓           ✓         ✓✓✓  Neutral, Positive
5  Sender_056           ⚘        None        None            Neutral

💡Data Insights

The group demonstrates a high frequency and diversity of emoji usage, with members regularly incorporating emojis into their messages. This indicates a digitally fluent and expressive communication culture.

  • Top Emojis: The most frequently used emojis include positive and celebratory symbols (e.g., 👍, 😊, 🎉, 🙏), which align with the group’s overall positive sentiment and culture of appreciation.
  • Contextual Use: Emojis are used to reinforce tone, convey non-verbal cues, and add emotional nuance to both professional and social messages. This enhances clarity and reduces the risk of misinterpretation in text-based communication.
  • Expressiveness and Openness: Frequent emoji users tend to be more expressive, open, and approachable. Their messages often set a friendly and inclusive tone, encouraging broader participation and engagement.
  • Positive Reinforcement: The use of emojis such as thumbs up, clapping hands, and smiley faces is strongly associated with encouragement, recognition, and support. This fosters a psychologically safe environment where members feel valued.
  • Individual Communication Styles: Analysis of emoji usage by individual members reveals distinct personality traits:
    • Enthusiasts: Members who use a wide variety of emojis, often in combination, are typically seen as energetic, creative, and socially active.
    • Minimalists: Members who use emojis sparingly may prefer direct, concise communication, reflecting a more reserved or task-focused personality.
    • Bridge Builders: Some members use emojis strategically to bridge professional and personal topics, facilitating smooth transitions and maintaining group cohesion.
  • Business and Community Implications
    • Enhanced Engagement: The group's rich emoji culture contributes to higher engagement, as members feel more connected and understood. This is particularly valuable in remote or asynchronous professional communities.
  • Cultural Sensitivity: The choice of emojis reflects cultural norms and shared values within the group, such as respect (🙏), celebration (🎉), and positivity (😊). This strengthens group identity and belonging.
  • Communication Efficiency: Emojis enable quick, effective communication of emotions and reactions, reducing the need for lengthy explanations and streamlining group interactions.

Media & File Sharing¶


What types of files are shared?¶

--- File Types Found (Excluding .txt) ---
.jpg: 487 file(s) - Images shared in the group
.mp4: 109 file(s) - Videos shared (memes, events, recordings)
.vcf: 18 file(s) - Contact cards shared
.opus: 6 file(s) - Voice messages
.webp: 13 file(s) - Stickers or compressed images
.pdf: 7 file(s) - Documents such as brochures, notes, etc.
.xlsx: 1 file(s) - Excel files, likely data or reports
.csv: 2 file(s) - Comma-separated data files
no_extension: 1 file(s) - No description available
No description has been provided for this image

When are files most frequently shared?¶

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Are links shared regularly?¶

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

How large are the shared files?¶

No description has been provided for this image

What time of day are files most shared?¶

No description has been provided for this image
No description has been provided for this image

What kinds of images do people share, and can we figure that out automatically?¶

✅ Loading cached prediction results...
               Media_Filename Image_Cat_Predicted  Prediction_Confidence  \
407   IMG-20190515-WA0006.jpg               Event                  0.424   
1251  IMG-20190708-WA0002.jpg        Announcement                  0.207   
2774  IMG-20191009-WA0008.jpg                Meme                  0.348   
3017  IMG-20191016-WA0005.jpg               Event                  0.749   
3079  IMG-20191018-WA0010.jpg               Event                  0.964   

     Image_Cat_from_Content  
407                   Event  
1251           Announcement  
2774                   Meme  
3017                  Event  
3079                  Event  
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

What themes are common in shared images?¶

--- Common Themes from Images ---

Cluster 1 (78 images):
  • a man in a yellow shirt and a black shirt with a face mask on
  • a man with a light on his face
  • a man with a mohawk mohawk and a suit
No description has been provided for this image
Cluster 2 (102 images):
  • a tweet tweet tweet tweet tweet tweet t
  • a text that reads, ` ` ' ' ' ' ' ' ' ' ' ' ' ' '
  • a screenshot of a text message from person
No description has been provided for this image
Cluster 3 (37 images):
  • a cartoon depicting the driver ' s face and the driver ' s face
  • a cartoon of two people riding on a motorcycle
  • a cartoon of a doctor talking to a patient
No description has been provided for this image
Cluster 4 (59 images):
  • a man and woman sitting at a table with a cup of coffee
  • a baby is being held by a woman in a hospital
  • a poster with a bunch of people on it
No description has been provided for this image
Cluster 5 (211 images):
  • happy diwali diya diya diya diya diya diya diya diya
  • a quote that says i don ' t know it ' s not the worst
  • mes, no one going up the ta that tree of bollywood ' s most popular villains are literally
No description has been provided for this image
No description has been provided for this image

Can we group images based on predicted captions?¶

No description has been provided for this image

💡Data Insights

The caption embeddings are grouped into a predefined number of clusters (e.g., 5) using the KMeans algorithm, which assigns each image to a single cluster based on the semantic similarity of its caption to others. This results in distinct image categories, where each cluster represents a dominant theme or concept derived from the captions.

The "Media & File Sharing" analysis demonstrates that the WhatsApp group has successfully leveraged multimedia content to create a rich, engaging, and informative communication environment. This approach not only enhances day-to-day interactions but also supports professional development, knowledge sharing, and community building. By continuing to encourage quality media sharing and recognizing key contributors, the group can sustain its dynamic and valuable digital community.

  • Media Usage Patterns and Volume

    • Diverse Media Types: The group demonstrates a rich and varied media sharing culture, with members regularly sharing images, videos, documents, and other file types. This indicates a dynamic and resource-rich communication environment.
    • Volume and Frequency: The analysis reveals significant media sharing activity, with members leveraging visual and multimedia content to enhance their messages and provide context. This suggests a preference for rich, engaging communication over text-only interactions.
  • Content Categories and Themes

    • Professional Content: A substantial portion of shared media includes professional documents, presentations, and informational graphics. This reflects the group's focus on knowledge sharing, collaboration, and professional development.
    • Social and Celebratory Content: Images and videos related to events, celebrations, and personal milestones are frequently shared, indicating a strong social bond and culture of recognition within the group.
    • Educational and Informational: Members share educational content, news articles, and informational videos, demonstrating a commitment to continuous learning and staying informed about relevant topics.
  • User Behavior and Engagement

    • Active Contributors: Certain members emerge as key media contributors, regularly sharing high-quality and relevant content. These individuals play a crucial role in maintaining group engagement and providing valuable resources.
    • Engagement Drivers: Media-rich messages typically receive higher engagement (likes, comments, responses) compared to text-only messages, highlighting the importance of visual content in driving interaction and participation.

  • Strategic Business Implications

    • Enhanced Communication Effectiveness: The group's media sharing culture enhances communication effectiveness by providing visual context, reducing ambiguity, and making complex information more accessible and engaging.
    • Knowledge Management: The regular sharing of documents and informational content supports collective knowledge building and ensures that important information is widely accessible to all members.
    • Community Building: Social and celebratory media content strengthens group bonds, fosters a sense of belonging, and creates shared memories that enhance group cohesion.

Mention & Email Analysis¶


Who mentions others most frequently?¶

No description has been provided for this image

What email domains are most common?¶

No description has been provided for this image

Demographic Patterns¶

This section explores the demographic composition of group participants based on available attributes, including age, gender, city, country, and highest education level. Analyzing these factors helps uncover patterns in group diversity, participation trends across different demographic segments, and potential correlations between user characteristics and messaging behavior. These insights provide valuable context for interpreting group dynamics and tailoring engagement strategies.

🚫 Limitations

In the absence of additional user-provided details, this analysis estimates the geographic distribution of users based solely on the country codes extracted from their phone numbers. While this provides a reasonable approximation of user location, it may not reflect actual residency or physical presence.

Where are users geographically located?¶

No description has been provided for this image

💡Data Insights

  • Mentions per SenderID

    • Sender_036 and Sender_011 are the most frequently mentioned individuals in the group (15 and 14 mentions respectively), suggesting high visibility or central roles in discussions.

    • A long tail of users with fewer mentions indicates a typical core-periphery structure, where a few individuals are frequently acknowledged and many contribute sporadically.

    • Mentions may correlate with Leadership or organizational roles, Contributions to shared knowledge or group events, and Social or professional influence within the group.

  • Most Frequently Shared Email Addresses

    A small number of email addresses are shared repeatedly, indicating trusted points of contact or recurring professional exchange (e.g., resource sharing, onboarding, event organization). Such email addresses are most likely shared for Networking or collaboration, Professional inquiries or follow-ups, and Event logistics or coordination

  • Most Shared Email Domains

    gmail.com dominates, reflecting the use of personal email accounts for communication and sharing.iimb.ac.in stands out as the most shared institutional domain, reinforcing the academic or alumni context of the group. Other domains like bridgepeople.in, tresorfit.com, zopperinsurance.com, and iimbaa.club** suggest entrepreneurial activity, organizational affiliation, and ongoing professional ventures tied to group members.


© 2025 Saurabh Kudesia

This project is licensed under the MIT License. You are free to use, modify, and distribute this code, provided you include proper attribution and retain the license notice. License: MIT